Turn-Taking Estimation Model Based on Joint Embedding of Lexical and Prosodic Contents
نویسندگان
چکیده
A natural conversation involves rapid exchanges of turns while talking. Taking turns at appropriate timing or intervals is a requisite feature for a dialog system as a conversation partner. This paper proposes a model that estimates the timing of turn-taking during verbal interactions. Unlike previous studies, our proposed model does not rely on a silence region between sentences since a dialog system must respond without large gaps or overlaps. We propose a Recurrent Neural Network (RNN) based model that takes the joint embedding of lexical and prosodic contents as its input to classify utterances into turn-taking related classes and estimates the turn-taking timing. To this end, we trained a neural network to embed the lexical contents, the fundamental frequencies, and the speech power into a joint embedding space. To learn meaningful embedding spaces, the prosodic features from each single utterance are pretrained using RNN and combined with utterance lexical embedding as the input of our proposed model. We tested this model on a spontaneous conversation dataset and confirmed that it outperformed the use of word embedding-based features.
منابع مشابه
Model Based Method for Determining the Minimum Embedding Dimension from Solar Activity Chaotic Time Series
Predicting future behavior of chaotic time series system is a challenging area in the literature of nonlinear systems. The prediction's accuracy of chaotic time series is extremely dependent on the model and the learning algorithm. On the other hand the cyclic solar activity as one of the natural chaotic systems has significant effects on earth, climate, satellites and space missions. Several m...
متن کاملProduction of English Lexical Stress by Persian EFL Learners
This study examines the phonetic properties of lexical stress in English produced by Persian speakers learning English as a foreign language. The four most reliable phonetic correlates of English lexical stress, namely fundamental frequency, duration, intensity, and vowel quality were measured across Persian speakers’ production of the stressed and unstressed syllables of five English disyllabi...
متن کاملTurn-taking, feedback and joint attention in situated human-robot interaction
In this paper, we present a study where a robot instructs a human on how to draw a route on a map. The human and robot are seated face-to-face with the map placed on the table between them. The user’s and the robot’s gaze can thus serve several simultaneous functions: as cues to joint attention, turn-taking, level of understanding and task progression. We have compared this face-to-face setting...
متن کاملPredicting User Satisfaction from Turn-Taking in Spoken Conversations
User satisfaction is an important aspect of the user experience while interacting with objects, systems or people. Traditionally user satisfaction is evaluated a-posteriori via spoken or written questionnaires or interviews. In automatic behavioral analysis we aim at measuring the user emotional states and its descriptions as they unfold during the interaction. In our approach, user satisfactio...
متن کاملTurn-taking in Mandarin Dialogue: Interactions of Tone and Intonation
Fluent dialogue requires that speakers successfully negotiate and signal turn-taking. While many cues to turn change have been proposed, especially in multi-modal frameworks, here we focus on the use of prosodic cues to these functions. In particular, we consider the use of prosodic cues in a tone language, Mandarin Chinese, where variations in pitch height and slope additionally serve to deter...
متن کامل